227 research outputs found

    Towards Effective Disambiguation for Machine Translation with Large Language Models

    Get PDF
    Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate ``ambiguous sentences'' - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl

    VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment

    Get PDF
    Ontology Alignment is an important research problem applied to various fields such as data integration, data transfer, data preparation, etc. State-of-the-art (SOTA) Ontology Alignment systems typically use naive domain-dependent approaches with handcrafted rules or domain-specific architectures, making them unscalable and inefficient. In this work, we propose VeeAlign, a Deep Learning based model that uses a novel dual-attention mechanism to compute the contextualized representation of a concept which, in turn, is used to discover alignments. By doing this, not only is our approach able to exploit both syntactic and semantic information encoded in ontologies, it is also, by design, flexible and scalable to different domains with minimal effort. We evaluate our model on four different datasets from different domains and languages, and establish its superiority through these results as well as detailed ablation studies. The code and datasets used are available at https://github.com/Remorax/VeeAlign.Comment: Duplicate of arXiv:2010.1172

    Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

    Get PDF
    Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains --- owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models

    Towards Effective Disambiguation for Machine Translation with Large Language Models

    Full text link
    Resolving semantic ambiguity has long been recognised as a central challenge in the field of Machine Translation. Recent work on benchmarking translation performance on ambiguous sentences has exposed the limitations of conventional Neural Machine Translation (NMT) systems, which fail to handle many such cases. Large language models (LLMs) have emerged as a promising alternative, demonstrating comparable performance to traditional NMT models while introducing new paradigms for controlling the target outputs. In this paper, we study the capabilities of LLMs to translate "ambiguous sentences" - i.e. those containing highly polysemous words and/or rare word senses. We also propose two ways to improve their disambiguation capabilities, through a) in-context learning and b) fine-tuning on carefully curated ambiguous datasets. Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions. Our research provides valuable insights into effectively adapting LLMs to become better disambiguators during Machine Translation. We release our curated disambiguation corpora and resources at https://data.statmt.org/ambiguous-europarl.Comment: WMT 202

    Ancient conserved domains shared by animal soluble guanylyl cyclases and bacterial signaling proteins

    Get PDF
    BACKGROUND: Soluble guanylyl cyclases (SGCs) are dimeric enzymes that transduce signals downstream of nitric oxide (NO) in animals. They sense NO by means of a heme moiety that is bound to their N-terminal extensions. RESULTS: Using sequence profile searches we show that the N-terminal extensions of the SGCs contain two globular domains. The first of these, the HNOB (Heme NO Binding) domain, is a predominantly α-helical domain and binds heme via a covalent linkage to histidine. Versions lacking this conserved histidine and are likely to interact with heme non-covalently. We detected HNOB domains in several bacterial lineages, where they occur fused to methyl accepting domains of chemotaxis receptors or as standalone proteins. The standalone forms are encoded by predicted operons that also contain genes for two component signaling systems and GGDEF-type nucleotide cyclases. The second domain, the HNOB associated (HNOBA) domain occurs between the HNOB and the cyclase domains in the animal SGCs. The HNOBA domain is also detected in bacteria and is always encoded by a gene, which occurs in the neighborhood of a gene for a HNOB domain. CONCLUSION: The HNOB domain is predicted to function as a heme-dependent sensor for gaseous ligands, and transduce diverse downstream signals, in both bacteria and animals. The HNOBA domain functionally interacts with the HNOB domain, and possibly binds a ligand, either in cooperation, or independently of the latter domain. Phyletic profiles and phylogenetic analysis suggest that the HNOB and HNOBA domains were acquired by the animal lineage via lateral transfer from a bacterial source

    Noether Currents of Charged Spherical Black Holes

    Full text link
    We calculate the Noether currents and charges for Einstein-Maxwell theory using a version of the Wald approach. In spherical symmetry, the choice of time can be taken as the Kodama vector. For the static case, the resulting combined Einstein-Maxwell charge is just the mass of the black hole. Using either a classically defined entropy or the Iyer-Wald selection rules, the entropy is found to be just a quarter of the area of the trapping horizon. We propose identifying the combined Noether charge as an energy associated with the Kodama time. For the extremal black hole case, we discuss the problem of Wald's rescaling of the surface gravity to define the entropy.Comment: 4 page
    • …
    corecore